Collecting Legacy Corpora from Social Science Research for Text Mining Evaluation
نویسنده
چکیده
In this poster we describe a pilot study of searching social science literature for legacy corpora to evaluate text mining algorithms. The new emerging field of computational social science demands large amount of social science data to train and evaluate computational models. We argue that the legacy corpora that were annotated by social science researchers through traditional Qualitative Data Analysis (QDA) are ideal data sets to evaluate text mining methods, such as text categorization and clustering. As a pilot study, we searched articles that involve content analysis and discourse analysis in leading communication journals, and then contacted the authors regarding the availability of the annotated texts. Regretfully, nearly all of the corpora that we found were not adequately maintained, and many were no longer available, even though they were less than ten years old. This situation calls for more effort to better maintain and use legacy social science data for future computational social science research purpose.
منابع مشابه
Spatializing a Digital Text Archive about History
1 Introduction The amount of digital text data available in online libraries has risen dramatically in recent years. GoogleBooks or the Universal Digital Library (UDL) initiatives illustrate this impressively. The rapid evolution of vast digital text data archives has spurred the growth of an interdisciplinary Digital Humanities (DH) community, as [1] puts it, the once inaccessible has suddenly...
متن کاملA System for Building FrameNet-like Corpus for the Biomedical Domain
Semantic Role Labeling (SRL) plays an important role in different text mining tasks. The development of SRL systems for the biomedical area is frustrated by the lack of large-scale domain specific corpora that are annotated with semantic roles. In our previous work, we proposed a method for building FramenNet-like corpus for the area using domain knowledge provided by ontologies. In this paper,...
متن کاملUsing Statistical Properties to Enhance Text Categorization
Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting...
متن کاملDesigning a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms
Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...
متن کاملA Novel Approach for Sentiment Analysis Using Classifiers Naive Bayes, SVM and Modified K-Means
Sentiments, evaluations, attitudes, and emotions are the subjects of study of sentiment analysis and opinion mining. The inception and rapid growth of the field coincide with those of the social media on the Web, e.g., reviews, forum discussions, blogs, micro blogs, Twitter, and social networks, because for the first time in human history, we have a huge volume of opinionated data recorded in d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015